PATREC[4,KMC]12 - www.SailDart.org

perm filename PATREC[4,KMC]12 blob sn#109206 filedate 1974-07-03 generic text, type T, neo UTF8
00100	COMMENT ⊗   VALID 00002 PAGES
00200	C REC  PAGE   DESCRIPTION
00300	C00001 00001
00400	C00002 00002
00500	C00039 ENDMK
00600	C⊗;
     

00100	
00200	
00300	      PATTERN-MATCHING RULES FOR THE RECOGNITION OF
00400	         NATURAL LANGUAGE DIALOGUE EXPRESSIONS
00500	
00600			Kenneth Mark Colby
00700	                      and
00800		        Roger C. Parkison
00900	
01000	
01100		            INTRODUCTION
01200	
01300		To recognize something is to identify it as  an  instance  of
01400	the "same again".   This familiarity is possible because of recurrent
01500	characteristics of  the  world  which  repeat  themselves.  We  shall
01600	describe  an  algorithm which recognizes recurrent characteristics of
01700	natural language dialogue  expressions.  It  utilizes  a  multi-stage
01800	sequence  of pattern-matching rules for progressively transforming an
01900	input expression until  it  eventually  matches  an  abstract  stored
02000	pattern.   The stored pattern has a pointer to a response function in
02100	memory which decides what to do once the input has  been  recognized.
02200	Here  we  discuss  only  the  recognizing  functions,  except for one
02300	response function (anaphoric substitution) which  interactively  aids
02400	the  recognition  process.    Details  of  how the response functions
02500	operate will be described in a future communication by William Faught
02600	and ourselves.
02700		We are constructing and  testing  a  simulation  of  paranoid
02800	thought  processes;  our  problem is to reproduce paranoid linguistic
02900	behavior  in  a  teletyped  diagnostic  psychiatric  interview.   The
03000	diagnosis   of  paranoid  states,  reactions  or  modes  is  made  by
03100	clinicians who judge the degree of correspondence between  what  they
03200	observe  in  an  interview  and  their  conceptual  model of paranoid
03300	behavior.   There  exists  a   high   degree   of   agreement   among
03400	psychiatrists about this conceptual model which relies mainly on what
03500	an interviewee says and how he says it.
03600		Natural  language  is a life-expressing code which people use
03700	for  communication  with  themselves  and  others.   In  a  real-life
03800	dialogue  such  as  a  psychiatric  interview,  the participants have
03900	interests, intentions, and expectations which are revealed  in  their
04000	linguistic  expressions.     An  interactive simulation of a paranoid
04100	patient must be  able  to  demonstrate  typical  paranoid  linguistic
04200	behavior.   To  achieve this effect, our paranoid model must have the
04300	ability to deal with the teletyped messages of an interviewer.
04400		A  number  of  approaches  have  been  taken for dealing with
04500	natural language dialogue expressions.  (Winograd,1972;  Woods,1970).
04600	These  approaches  rely on parsers which conduct a detailed syntactic
04700	and semantic analysis. They perform well for the purposes  for  which
04800	they  were  designed. Their weakness, for our purposes, lies in their
04900	lack of neglecting and  ignoring  mechanisms.   Such  mechanisms  are
05000	necessary  in  a  program  which accepts and responds to unrestricted
05100	conversational English characterized  by  expressions  novel  to  the
05200	program.
05300		How humans process natural language is largely unknown.  They
05400	possess  some  knowledge of grammatical rules, but this fact does not
05500	entail  that  they  use  a  grammar  in  interpreting  and  producing
05600	language.  It  seems  implausible  to  us  that  people  possess full
05700	transformational grammars for processing  language.      Language  is
05800	what  is  recognized but the processes involved may not be linguistic
05900	or grammatical.      Originally transformational  grammars  were  not
06000	designed  to "understand" a large subset of English; they constituted
06100	a formal method for deciding whether a string is grammatical.
06200		An  analysis  of  what one's problem actually is should guide
06300	the selection or invention of methods appropriate  to  its  solution.
06400	Our  problem  is  not  to  develop a consistent and general theory of
06500	language nor to assert  empirically  testable  hypotheses  about  how
06600	people  process  language.    Our  problem  is to design an algorithm
06700	which recognizes what is being said in a dialogue and what  is  being
06800	said  about  it in order to make a response such that a sample of I-O
06900	pairs from the paranoid model is judged similar to a  sample  of  I-O
07000	pairs   from   paranoid  patients.     The  design  task  belongs  to
07100	artificial  intelligence  in  which  the  criterion  is  whether  the
07200	computer program performs mind-like functions.  New methods had to be
07300	devised for an algorithm to participate in  a  human  dialogue  in  a
07400	paranoid-patient-like  way.   We sought effective methods which could
07500	operate  efficiently  in  real  time.     Since our method provides a
07600	general way of many-to-one mapping  from  surface  expressions  to  a
07700	single  stored  pattern,  it  is  not  limited  to  the simulation of
07800	paranoia, but can be used by any type of "host"  system  which  takes
07900	natural language as input.
08000		Our method is to transform  the  input  until  a  pattern  is
08100	obtained which matches completely or partially a more abstract stored
08200	pattern.  This strategy  has  proved  adequate  for  our  purposes  a
08300	satisfactory  percentage  of the time.   The power of this method for
08400	natural  language  dialogues  lies  in  its  ability  to  ignore   as
08500	irrelevant  some  of  what  it  recognizes and everything it does not
08600	recognize  at  all.    A   linguitic   parser   doing   word-by-word,
08700	parts-of-speech analysis fails when it cannot find one or more of the
08800	input words in its dictionary. A system that must know every word  is
08900	too fragile for unrestricted dialogues.
09000		In early versions of the paranoid model, such as PARRY1, some
09100	of  the  pattern  recognition  mechanisms allowed the elements of the
09200	pattern to be order independent (Colby, Weber, and Hilf, 1971).   For
09300	example, consider the following expressions:
09400		(1) WHERE DO YOU WORK?
09500		(2) WHAT SORT OF WORK DO YOU DO? 
09600		(3) WHAT IS YOUR OCCUPATION? 
09700		(4) WHAT DO YOU DO FOR A LIVING? 
09800		(5) WHERE ARE YOU EMPLOYED? 
09900	In  PARRY1  a  procedure  scans  these  expressions  looking  for  an
10000	information-bearing  contentive  such as "work", "for a living", etc.
10100	When it finds such a contentive along with "you"  or  "your"  in  the
10200	expression,  regardless  of word order, it responds to the expression
10300	as if it were a question about the nature of one's work. This  method
10400	correctly  classifies  the  five  sentences  above. Unfortunately, it
10500	includes the two examples below in the same category:
10600		(6) DOES YOUR FATHER'S CAR WORK?
10700		(7) HOW DID THINGS WORK OUT FOR YOU?
10800	An  insensitivity  to word order has the advantage that lexical items
10900	representing  different  parts  of  speech  can  represent  the  same
11000	concept,e.g.  the  word "work" represents the same concept whether is
11100	used as a noun or a verb. But a price is paid for this resilience and
11200	elasticity.   We  find  from  experience  that,  since English relies
11300	heavily on word order to convey the  meaning  of  its  messages,  the
11400	average   penalty  of  misunderstanding  (to  be  distinguished  from
11500	ununderdstanding),  is  too  great.  Hence  in  PARRY2,  as  will  be
11600	described shortly, all the patterns require a specified word order.
11700		For  high-complexity  problems  it   is   helpful   to   have
11800	constraints.        Diagnostic psychiatric interviews (and especially
11900	those conducted over teletypes)  have  several  natural  constraints.
12000	First,  clinicians  are  trained  to ask certain questions in certain
12100	ways.    This limits the number of  patterns  required  to  recognize
12200	utterances  about  each  topic.  Second,  only a few hundred standard
12300	topics are brought up by interviewers who are,  furthermore,  trained
12400	to  use everyday expressions and especially those used by the patient
12500	himself.  When the interview is conducted by  teletypes,  expressions
12600	tend  to  be  shortened  since  the interviewer tries to increase the
12700	information transmission rate over the slow channel  of  a  teletype.
12800	Finally,   teletyped  interviews  represent  written  utterances  and
12900	utterances are known to be highly redundant  such  that  unrecognized
13000	words  can be ignored without losing the meaning of the message. Also
13100	utterances are loaded with idioms, cliches, pat phrases, etc.       -
13200	all  being  easy  prey  for  a  pattern-matching  approach.    It  is
13300	time-wasting and  usually  futile  to  try  to  decode  an  idiom  by
13400	analyzing the meanings of its individual words.
13500		We  now  describe  the  pattern-matching  functions  of   the
13600	algorithm  in  some  detail. (See Fig. 1 for a diagram of the overall
13700	flow of control).
13800	
13900	
14000			    OVERVIEW
14100	
14200		PARRY2 has  two  primary  modules.   The  first  attempts  to
14300	RECOGNIZE the input and the second RESPONDS.  This paper is primarily
14400	about the  RECOGNIZE  module.   It  functions  independently  of  the
14500	RESPOND  module  except  in the case of pronoun references, which the
14600	RESPOND module provides to the RECOGNIZER on request.
14700		The recognition module has 4 main steps:
14800		1) Identify the words in the question and convert them to
14900			internal synonyms.
15000		2) Break the input into segments at certain bracketing words.
15100		3) Match each segment (independently) to a stored pattern.
15200		4) Match the resulting list of recognized segments to a stored
15300			  compound pattern.
15400	Each  of  these  steps,  except  the  segmenting, throws away what it
15500	cannot identify.  Occasionally a reference to  an  unknown  topic  is
15600	mis-recognized as some familiar topic.
15700	
15800	 		    PREPROCESSING
15900	
16000		Each word in the input expression is first  looked  up  in  a
16100	dictionary  of(currently)  about  1900 entries which, for the sake of
16200	speed, is maintained in core during run-time.  Illustrative  examples
16300	of  dictionary entries are given in Appendix 2. The dictionary, which
16400	was built empirically from thousands  of  teletyped  interviews  with
16500	previous  versions  of the model, consists of words, groups of words,
16600	and names of word-classes they can be translated into.  The  size  of
16700	the  dictionary  is  determined  by  the patterns, i.e. a word is not
16800	included unless it appears in one or more patterns.  Entries  in  the
16900	dictionary  reflect  PARRY2's main interests.  If a word in the input
17000	is not in the dictionary,  it  is  dropped  from  the  pattern  being
17100	formed. Thus if the input is:
17200		WHAT IS YOUR CURRENT OCCUPATION?
17300	and  the  word  "current"  is   not in the dictionary, the pattern at
17400	this stage becomes:
17500		( WHAT IS YOUR OCCUPATION )   
17600	The question-mark is thrown away as  redundant  since  questions  are
17700	recognized  by  word  order. (A statement followed by a question mark
17800	(YOU GAMBLE?) is responded to in  the  same  way  as  that  statement
17900	followed  by  a  period.) Synonymic translations of words are made so
18000	that the pattern becomes, for example:
18100		( WHAT  BE  YOU  JOB )   
18200		Some groups of words (i.e. idioms) are translated as a  group
18300	so that, for example, "for a living" becomes "for job". Certain other
18400	juxtaposed words are contracted into a single word,  e.g.  "place  of
18500	birth"  becomes  "birthplace".  This  is  done to deal with groups of
18600	words which are  represented  as  a  single  element  in  the  stored
18700	pattern,  thereby preventing segmentation from occurring at the wrong
18800	places, such as at a preposition inside an idiom or  phrase.  Besides
18900	these  contractions, certain expansions are made so that for example,
19000	"DON'T" becomes "DO NOT" and "I'D" becomes "I WOULD".
19100		Misspellings  can  be the bane of teletyped interviews for an
19200	algorithm.  Here  they  are  handled  in  two  ways.  First,   common
19300	misspellings  of  important  words  are simply put in the dictionary.
19400	Thus "yoy" is known to mean "you". The apostrophe  is  often  omitted
19500	from contractions so most contractions are recognized with or without
19600	it. These common misspellings were gathered from over 4000 interviews
19700	with earlier versions of the paranoid model.
19800		Second,  five  common  forms  of  typing  error  are  checked
19900	systematically.  These are:
20000		1) Doubled letter
20100		2) Extraneous letter
20200		3) Forgetting to hold the "shift key" for an apostrophe
20300		4) Hitting a nearby key on the keyboard
20400		5) Transposing two letters in a word
20500	The first three errors can be corrected  by  deleting  the  offending
20600	character  from  the  word.    This  is accomplished by deleting each
20700	character in turn until the word is recognized.  The fourth  type  of
20800	error is only checked for 10 of the more common near misses.    These
20900	were also empirically determined and include letter pairs like (T Y),
21000	(O  P),  (A  S),  and  (N M).   These methods are all based on typing
21100	errors, but  they  also  correct  some  legitimate  English  spelling
21200	errors.    Two-letter  transposition corrects, for example, "beleive"
21300	to "believe".
21400	
21500	                      SEGMENTING
21600	
21700		Another  weakness  in the crude pattern matching of PARRY1 is
21800	that it takes the entire input expression  as  its  basic  processing
21900	unit.   If  only two words are recognized in an eight word utterance,
22000	the risk of misunderstanding is great. We need a way of dealing  with
22100	units shorter than the entire input expression.
22200		Aided by a heuristic from work in machine-translation (Wilks,
22300	1973  ), we devised a way of bracketing the pattern constructed up to
22400	this  point  into  shorter  segments  using  prepositions,  wh-forms,
22500	certain  verbs, etc. as bracketing points.  (A list of the bracketing
22600	terms  appears  in  Fig.  2).    These  points   tend   to   separate
22700	prepositional  phrases  and  embedded  clauses  from the main clause.
22800	The  new  pattern  formed  is  termed  either  "simple",  having   no
22900	delimiters  within  it,  or "compound", i.e., being made up of two or
23000	more simple patterns. A simple pattern might be:
23100		( WHAT BE YOU JOB )
23200	whereas a compound pattern would be:
23300		(( WHY BE YOU ) ( IN HOSPITAL ))
23400	Our experience with this method of segmentation shows  that  compound
23500	patterns  from teletyped psychiatric dialogues rarely consist of more
23600	than three or four segments. 
23700		After certain verbs (See  Fig.  4)  a  bracketing  occurs  to
23800	replace the commonly omitted "THAT", such that:
23900		( I THINK YOU BE AFRAID )
24000	becomes
24100		(( I THINK ) ( YOU BE AFRAID ))
24200	
24300			   MATCHING INDIVIDUAL SEGMENTS
24400	
24500		Conjunctions serve only as markers for the segmenter and they
24600	are dropped out after segmentation.
24700		Negations are  handled  by  extracting  the  "NOT"  from  the
24800	segment  and  assigning  a value to a global variable which indicates
24900	that the expression is negative in form. When a  pattern  is  finally
25000	matched,  this variable is consulted. Some patterns have a pointer to
25100	a pattern  of  opposite  meaning  if  a  "NOT"  could  reverse  their
25200	meanings.      If this pointer is present and a "NOT" was found, then
25300	the pattern matched is replaced by its opposite, e.g.  ( I not  trust
25400	you  ) is replaced by the pattern ( I mistrust you ). We have not yet
25500	observed the troublesome  case  of  "he  gave  me  not  one  but  two
25600	messages". (There is no need to scratch where it doesn't itch).
25700		Substitutions  are also made in certain cases.  Some segments
25800	contain pronouns which could stand for a number of  different  things
25900	of  importance  to PARRY2.       As we mentioned in the introduction,
26000	the response functions of memory keep track of the context  in  order
26100	to  give  pronouns and other anaphoras a correct interpretation.  For
26200	example, the segment:
26300		( DO YOU AVOID THEM )
26400	could refer to the Mafia, or racetracks, or other patients, depending
26500	on the context.  When such a segment is encountered,  the  pronoun  is
26600	replaced by its current anaphoric value as determined by the response
26700	functions, and a more specific segment such as:
26800		( DO YOU AVOID MAFIA )
26900	is  looked  up.
27000		Other  utterances,  such  as  "Why  did you do that?" or just
27100	"Why?" (which might be regarded as a massive ellipsis), clearly refer
27200	back  to  previous  utterances.   These utterances match very general
27300	patterns which identify the type of question without  indicating  the
27400	exact  topic. The response function which responds to "Why?" consults
27500	the context to produce an appropriate answer.
27600		The algorithm next attempts to match the segments with stored
27700	simple patterns which  currently  number  about  1700.  (Examples  of
27800	simple  patterns  appear in Appendix 3). First a complete and perfect
27900	match is sought.   When a match is found, the stored pattern name has
28000	a  pointer to the name of a response function in memory which decides
28100	what to do further.  If a match is not found, further transformations
28200	of the segment are carried out and a "fuzzy" match is tried.
28300		For fuzzy matching at this stage, we  adopted  the  heuristic
28400	rule of dropping elements in the segment one at a time and attempting
28500	a match each time.   This heuristic allows ignoring familiar words in
28600	unfamiliar  contexts.    For example, "well" is important in "Are you
28700	well?" but meaningless in "Well are you?".
28800	 	Deleting one element at a time results  in,  for example, the
28900	pattern:
29000			( WHAT BE YOU MAIN PROBLEM )
29100	becoming successively:
29200			(a) ( BE YOU MAIN PROBLEM )
29300			(b) ( WHAT YOU MAIN PROBLEM )
29400			(c) ( WHAT BE MAIN PROBLEM )
29500			(d) ( WHAT BE YOU PROBLEM )
29600			(e) ( WHAT BE YOU MAIN )
29700	Since the stored pattern in this case matches (d), (e) would  not  be
29800	constructed.   We  found  it  unwise  to delete more than one element
29900	since our segmentation method usually yields  segments  containing  a
30000	small number (1-4) of words.
30100		Dropping  an  element  at  a  time  provides  a   probability
30200	threshold  for  fuzzy   matching which is a function of the length of
30300	the segment. If a segment consists of five elements, four of the five
30400	must be present in a particular order (with the fifth element missing
30500	in any position) for a match to occur. If  a  segment  contains  four
30600	elements, three must match - and so forth.
30700	
30800			   COMPOUND-PATTERN MATCH
30900	
31000		When more than one simple pattern is detected in the input, a
31100	second  matching  is  attempted  against about 500 compound patterns.
31200	Certain patterns, such as ( HELLO ) and (  I  THINK  ),  are  dropped
31300	because  they  are considered meaningless. If a complete match is not
31400	found, then simple patterns are dropped, one  at  a  time,  from  the
31500	complex pattern. This allows the input,
31600		(( HOW DO YOU COME ) ( TO BE ) ( IN HOSPITAL ))
31700	to  match  the  stored  pattern,
31800		(( HOW  DO  YOU COME ) ( IN HOSPITAL )).
31900	
32000		If  no  match  can  be found at this point, the algorithm has
32100	arrived at a default condition and the appropriate response functions
32200	decide  what  to  do.  For example, in a default condition, the model
32300	may assume  control  of  the  interview,  asking  the  interviewer  a
32400	question, continuing with the topic under discussion or introducing a
32500	new topic.
32600		An annotated example of a diagnostic psychiatric interview is
32700	presented in Appendix 1.
32800	
32900	
33000			   ADVANTAGES AND LIMITATIONS
33100	
33200		As   mentioned,   one   of   the   main   advantages   of   a
33300	pattern-matching strategy is that it can ignore  as  irrelevant  both
33400	some  of  what  it  recognizes and what it does not recognize at all.
33500	There are several million words in English, each possessing from  one
33600	to  over  a  hundred  senses.     To  construct a machine-usable word
33700	dictionary of this magnitude is out of the  question  at  this  time.
33800	Recognition  of  natural language input in the manner described above
33900	allows real-time interaction in a dialogue since it  avoids  becoming
34000	ensnarled   in  combinatorial  disambiguations  and  long  chains  of
34100	inferencing  which  would  slow  a   dialogue   algorithm   down   to
34200	impracticality,  if it could even function at all. The price paid for
34300	pattern-matching is that  sometimes,  but  rarely,  ambiguities  slip
34400	through.
34500		A drawback to PARRY1 is  that it reacts to the first pattern
34600	it  finds  in the input rather than characterizing the input as fully
34700	as possible and then deciding what to do based on a number of  tests.
34800	Another  practical  difficulty  with  PARRY1  from   a   programmer's
34900	viewpoint, is that elements of the patterns are strung out in various
35000	procedures throughout the algorithm.    It is  often  a  considerable
35100	chore for the programmer to determine  whether  a  given  pattern  is
35200	present  and  precisely where it is.   In PARRY2 the patterns are all
35300	collected in one part of the  data-base  where  they  can  easily  be
35400	examined.
35500		Concentrating all the patterns in the data base gives  PARRY2
35600	a  limited  "learning"  ability.   When  an  input fails to match any
35700	stored pattern or matches an incorrect one,  as  judged  by  a  human
35800	operator,  a  pattern  which  matches  the  input can be put into the
35900	data-base automatically.  If the new pattern has the same meaning  as
36000	a previously stored pattern, the human operator must provide the name
36100	of the appropriate response function.  If  he  doesn't  remember  the
36200	name,  he  may  try  to  rephrase the input in a form recognizable to
36300	PARRY2 and it will name the response  function  associated  with  the
36400	rephrasing.  These mechanisms are not "learning" in the commonly used
36500	sense but they do allow a  person  to  transfer  his  knowledge  into
36600	PARRY2's data-base with very little  effort.
36700		Informal  observation  thus  far  shows  PARRY2's  linguistic
36800	recognition  abilities  to  be  quite  superior  to  PARRY1's. A more
36900	systematic and quantitative evaluation of performance  is  now  being
37000	carried  out.  Parry1  was  extensively  tested by having judges make
37100	ratings of its performance along several dimensions, one of which was
37200	linguistic noncomprehension (Colby and Hilf, 1974). These judges also
37300	made ratings of teletyped interviews with  psychiatric  patients  and
37400	with  a  random version of PARRY1. Once the ratings of PARRY2 along a
37500	linguistic dimension completed, we will be able to compare them  with
37600	the  ratings of PARRY1 and measure the improvement of the model along
37700	this dimension.
37800	
37900	
38000	                  REFERENCES
38100	
38200	Colby, K.M., Weber, S., and Hilf, F.D. (1971). Artificial Paranoia.
38300		ARTIFICIAL INTELLIGENCE, 2, 1-25.
38400	
38500	Colby, K.M. and Hilf, F.D. (1974). Multidimensional Evaluation of
38600	       of a Computer Simulation of Paranoid Thought. To appear
38700	       in KNOWLEDGE AND COGNITION, L. Gregg, (Ed.)
38800	Wilks, Y. (1973). An Artificial Intelligence Approach to Machine
38900		Translation. In COMPUTER MODELS OF THOUGHT AND 
39000		LANGUAGE, R.C.Schank and K.M. Colby, Eds., W.H. Freeman,
39100		San Francisco.